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Abstract 

For the vast majority of genome wide association studies (GWAS) published so 
far, statistical analysis was performed by testing markers individually. In this arti- 
cle we present some elementary statistical considerations which clearly show that 
in case of complex traits the approach based on multiple regression or generalized 
linear models is preferable to multiple testing. We introduce a model selection 
approach to GWAS based on modifications of Bayesian Information Criterion 
(BIC) and develop some simple search strategies to deal with the huge number 
of potential models. Comprehensive simulations based on real SNP data confirm 
that model selection has larger power than multiple testing to detect causal SNPs 
in complex models. On the other hand multiple testing has substantial problems 
with proper ranking of causal SNPs and tends to detect a certain number of false 
positive SNPs, which are not linked to any of the causal mutations. We show 
that this behavior is typical in GWAS for complex traits and can be explained 
by an aggregated influence of many small random sample correlations between 
genotypes of a SNP under investigation and other causal SNPs. We believe that 
our findings at least partially explain problems with low power and nonreplicabil- 
ity of results in many real data GWAS. Finally, we discuss the advantages of our 
model selection approach in the context of real data analysis, where we consider 
publicly available gene expression data as traits for individuals from the HapMap 
project. 

Keywords: Genome wide association, Multiple testing, Linear regression, 
Model selection, mBIC 



1. Introduction 

Within the last five years genome wide association studies (GWAS) have 
become an important tool for genetic scientists. There exist several excellent 
reviews which elucidate the statistical intricacies involved, see e.g. (5j [30j 1351 H9] . 
The major goal of GWAS is to detect association between some trait (either 
quantitative or categorical) and some genetic markers. The most commonly used 
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type of markers are single nucleotide polymorphisms (SNPs). Current SNP array 
technology allows to determine the state of up to one million SNPs within a single 
experiment. 

The huge number of markers leads to a multiple testing problem which has 
been extensively discussed in the literature (see the discussion on this topic in 
[49J). It is common practice in applied papers on GWAS to report single marker 
tests of SNPs. For a review on statistical tests for case control studies we re- 
fer again to [49]. Recommended significance levels to control family wise error 
are as small as a = 5 ■ 10 -8 [23], though occasionally larger significance levels 
like a = 10~ 6 are used. It is well understood that due to positive correlations 
between markers a simple Bonferroni correction is likely to be too conservative. 
Approaches to deal with correlation between SNPs include permutation tests like 
in [37] or the use of Hidden Markov Models [3H] • 

Most GWAS are performed control studies, though recently there has 

been growing interest in GWAS for quantitative traits (QT) [33]. Statistical 
analysis performed for QT tends to make use of regression models to correct for 
covariates like sex or age; however, each SNP is then tested individually at a 
significance level accounting for multiple testing (see for example [2D 12E1 [37J, [3HJ 
132] etc.). 

In the fairly related area of QTL mapping based on designed breeding ex- 
periments the search over single markers was abandoned already quite a while 
ago in favor of multi marker models. In this context the problem of selection 
of significant markers is equivalent to the choice of the "best" multiple regres- 
sion or generalized linear model. This task is however rather difficult due to the 
large number of potential regressors. Specifically, in JTH] it was noticed that clas- 
sical model selection criteria like Akaike Information Criterion (AIC, [2]), and 
even Bayesian Information Criterion (BIC, [33]), tend to select too many mark- 
ers. Addressing this problem [T2] introduced a modified version of BIC (mBIC), 
suited for the situation where a large number of markers is searched, but only 
relatively few markers are expected to be true signals. mBIC was motivated in 
a Bayesian setting, using informative priors on the model dimension, which pre- 
fer rather small models. In [11] and [15] it was observed that mBIC penalty is 
closely related to the multiple regression version of the Bonferroni correction for 
multiple testing. In a series of papers [31 131 CD EE 1231 EI] based on simulation 
studies good properties of mBIC were documented. Recently some asymptotic 
optimality properties of mBIC have been shown [101 126] . 

In this article we will adopt a model selection approach for GWAS, which 
will be introduced in Section [2] For the ease of presentation we will restrict our 
discussion mainly to linear regression models, though qualitatively similar results 
will hold for generalized linear models. In Section 24] we present basic statistical 
considerations, which clearly demonstrate the advantage of using multiple regres- 
sion over the search over single markers. In this section we specifically stress the 
lower power of single marker tests. This results from an inflated residual sum 
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of squares, which incorporates the influence of all causal genes that are not in- 
cluded in the model. In Section 2.2 we introduce and motivate our particular 
choice of model selection criteria for the high dimensional multiple regression we 
are dealing with. Apart from mBIC we consider a second modification of BIC, 
mBIC2, which according to [26] is closely related to the multiple regression ver- 
sion of the Benjamini-Hochberg procedure [8] for multiple testing. According to 
[26] mBIC2 has asymptotic optimality properties in a wider range of sparsity 
parameters than mBIC. The greatest challenge in applying model selection to 
GWAS is the huge search space of potential models. In Section 2J3 we describe 
some model search strategies which are particularly suited to the situation where 
we have a huge number of markers but expect only a small number of them to 
be strongly associated with the trait. 

In Section [3] we perform a simulation study based on actual SNP data. Apart 
from the expected result that model selection strategies outperform multiple test- 
ing procedures the simulation study provides some rather surprising insights. In 
particular we will see that for multiple testing procedures rather small random 
correlations between causal SNPs have drastic influence on the order of p-values. 
This results in a low power to detect some important causal mutations, as well 
as in a large number of spurious detections. Thus, our results show that single 
marker tests can lead to many erroneous results, which makes the use of these 
procedures in the context of GWAS rather questionable. 

Finally in Section [1] we reanalyze publicly available gene expression data [321 
H6JI37] as quantitative traits for the individuals genotyped in the HapMap project 
[2H]- One aim of [37] was to detect association with gene expression levels of 
SNPs lying outside the region of the considered gene (trans regulatory SNPs). 44 
genes with trans regulatory SNPs were reported when pooling over all HapMap 
populations. Using our model selection approach we are able to increase the 
number of detected trans regulatory regions in several cases. 



2. Methods 

We first want to motivate why we advocate a model selection approach. The 
discussion in this article is in the context of linear regression models, which allows 
to focus on the principal ideas involved. We expect that similar conclusions as 
presented here will also hold for generalized linear models, and in particular for 
logistic regression, which might be applied in case control studies. 

2.1. Linear regression models 

Let yi,i G {1, • • • , n}, denote measurements of a quantitative trait in n indi- 
viduals. Assume there are p SNPs where p 3> n, but only ^ < n of them have 
an effect on y. Let j* = . . . ,jf), with 1 < < . . . < jf < p, denote the 
ordered set of indexes of causal SNPs. We denote the genotype of person i and 
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SNP G { — 1,0, 1} and assume that the additive model 

k 

My : yi = 0o + fiti x m + e i CO 
1=1 

holds. For the sake of simplicity we assume that the error terms are i.i.d. normal 
random variables ~ A/"(0,cr 2 ). In a more realistic scenario the model can be 
easily extended by including other covariates, like sex or age. 

The model proposed in is rather simple. However, complexity arises be- 
cause the task is to find this model among 2 P possible models (we consider only 
models including the intercept). This is a gigantic number taking into account 
that in GWAS we are usually dealing with p m 10 7 . The null model which does 
not include any causal SNP will be denoted by M.$. All further additive models 
can be characterized by multi indices Aij, where j is an ordered subset of ele- 
ments of the set {1, . . . ,p}. Generically we will write q = gj for the number of 
markers of a model. For the correct model we have gj* = k. 

Define for each model A^j the matrix X 3 = (1 ), where 1 = (1, . . . , 1)'. 

Then we have in vector notation 

M r . y = X^ i + e i (2) 

where /3j = ((3q, (3^, . . . , (3j q )' . Given the large number of markers it is under- 
standable that it is common practice to perform only single marker analysis. 
This means that only models of the form 

SMj : //, lu ■ '> r r, :! + (3) 

are considered. 

Elementary statistics tells us what happens when we perform an F-test for 
some model A^j based on least squares regression. Let Pj = (X 3 )'[X 3 (X 3 y]~ 1 X 3 
denote the usual hat-operator for a general model M.y Then RSSj := y'(I — Pj)y 
is the residual sum of squares and MSSj := - \E)y is the model sum of 
squares for M.y We always denote by / the identity matrix and by E = 11' 
the all one matrix of suitable dimension (here they are n x n). The usual F-test 
statistic for the null hypothesis that none of the variables in the model Aij has 
an influence on Y is given by 

(n - gj - 1)MSS } 
qjRSSj 

When the model A4j includes all causal SNPs then the statistics Fj has a non- 
central F-distribution and power calculations for different effect sizes are rather 
straight forward (compare results for j = 1 in Figure [T]). However, we are often 
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facing a different situation when model .Mj* holds, but we are performing an F- 
test for a smaller model .Mj, which might not include some of the causal SNPs. 
Then 

'=i ;ej*\{j} 
and according to the generalization of Cochran's theorem for the noncentral case 
[33] the model sum of squares and the residual sum of squares are independent 
with distributions 

M c V- 1 1 

~ x\%-A*(X r Y(Pi--E)X^1), (4) 

pec 1 

— ^ ~ X \n-q- i -l,-^{X^)\I-P- i )X^- r ). (5) 

Here x 2 { u i v ) denotes a noncentral chi-square distribution, with the number of 
degrees of freedom equal to u and a noncentrality parameter v. Thus the test 
statistic Fj is essentially the ratio of two independent noncentral x 2 -distributed 
random variables. If the size of the true model q^* is much larger than the size gj of 
the model under consideration, then the residual sum of squares RSS- } will have a 
considerably large noncentrality parameter incorporating effects which have not 
entered the model, and the power of the according F-test will be comparably 
small. 

This effect will be most pronounced in case of simple regression models ([3]). 
To fix ideas consider for a moment the orthogonality assumption (X**yx** = nl. 
Then the noncentrality parameters corresponding to MSSj and RSSj respec- 
tively become 

"M,j = and urj = — T ( 6 ) 

'ej*\{j} 

In Figure [T] power calculations are shown for this simplified situation with n = 
2000 and a = 10~ 6 . The squared scaled effect sizes r = are equal for all k 
effects. Power was obtained by sampling from the two noncentral ^-distributions 
of Q and ([5]). If there is only a small number of causal SNPs the loss of power 
by testing for individual markers is not dramatic. However already for k = 10 
the loss becomes recognizable, and for k = 30 one is actually losing more than 
50% of power in the range of effect sizes we considered. 

Now in GWAS one can certainly not expect that all causal SNPs have the 
same effect size, and their genotypes will also not be orthogonal. However, GWAS 
are performed to understand the genetics of complex traits, which per definition 
are influenced by more than one factor. Therefore our considerations concerning 
loss of power by single marker analysis will apply. For a single effects model SM. j 
one obtains 

WW} v { j) J 
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Figure 1: Power to find a causal SNP with single marker tests when model Aiy with k effects 
is correct. In case of k = 1 one is testing for the correct model. 



where Xj is the sample mean and Var(xj) and Cov(xj, x{) are the sample variance 
and covariance respectively. The noncentrality parameters for single marker tests 
have the form 

fejLi f3iCov(xj,xi)) 
"M,j = S z^JZT\ — - (7) 



and 



EV- PlPr 



Cov(xi, X r ) 



Cov(xi, Xj)Cov(x r ,Xj 
Var(xj) 



Compared to the case of orthogonality in (JHJ things are slightly more com- 
plicated due to correlation effects, which might have a strong influence on the 
noncentrality parameters for RSSj and MSSy In section [3] we will show that 
the joint influence of many small random correlations between causal SNPs has a 
very strong effect on the noncentrality parameters vm ■ This results in substantial 
problems with ranking of p- values and leads both to a low power of detection of 
some of the causal SNPs as well as to a relatively large number of false positives. 
Also, the general problem of loss of power when testing for single markers under 
a complex true model remains the same. We thus feel justified to claim that a 
model selection approach is the favorable alternative to single marker analysis. 

2.2. Modifications of BIG 

Assume that a family of models J\4- } has parameters 9j and corresponding 
likelihood functions L-A9C). Denote by 9\ the maximum likelihood estimates of 9\. 
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Many statistical model selection criteria, like for example AIC or BIC, suggest 
to select that model which maximizes a penalized likelihood function of the form 



logLj^)-^. (9) 

For AIC and BIC the penalty parameter rj takes the form 1 and | log n respec- 
tively. 

For linear regression under the assumption of normal error terms e J ~ A/"(0, a 2 I) 
the likelihood function of each model Aij in (|2j) is given by 

lmm = ^^ex P (Jy- xi my-m\ 

Jiy| J ; (y/2^a) n V 2a 2 J 

The maximum likelihood estimator of /3j then coincides with the least squares 
regression estimator (3- } and thus 



For fixed a BIC is then equivalent to minimize 



1 gj logn . (10) 



a 2 



For unknown o the ML-estimate <r 2 = leads to the criterion 

n 

n\ogRSS- } + qjlogn . (11) 

For sample size n > 8 the penalty parameter in AIC is smaller than in BIC 
and therefore in this case BIC tends to select more parsimonious models than 
AIC. Also, it is known that when p is fixed and n goes to infinity then BIC is 
consistent. Thus, when n is large and p<n, BIC usually selects the true model 
with large probability. However, the situation is very different in case of p > n. 
As explained in detail in |15] BIC is derived with the underlying prior assumption 
that all possible models Aij are chosen with the same probability. This effectively 
results in using a Binomial prior B(p, 1/2) on the model dimension. Thus, BIC 
assigns a high prior probability to the class of models of size p/2, whereas small 
or very large dimensions are much less likely a priori. Under sparsity, where the 
actual model has only a small number of regressors, this results in BIC choosing 
too many regressors. 

As a remedy for this situation a modification of BIC was introduced in [T2] , 
which can be formulated as 

mBIC: -21ogLj(0j) + gj(logn + 21ogp + d) . (12) 
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This criterion was derived in a Bayesian setting assuming a prior probability of 



the model M.^ of the form 



7r(j) =u*{l-uy-* 



In our context o> can be interpreted as the a priori expected proportion of causal 
SNPs, where all SNPs have independently from each other the same chance of 
being causal. This is a typical prior assumption in Bayesian model selection (see 



e.g. |20j). Incorporating this prior distribution into BIC we easily obtain (12) 
with d = —2 log (pa;), i.e. minus two times the logarithm of the expected number 
of causal SNPs (for details of this derivation see [T2]). 

In case of known a and under the assumption of orthogonal regressors mBIC 
has been shown to be closely related to the Bonferroni correction rule for multiple 
testing [13]. In particular mBIC is controlling the family wise error. In [2E] 
it is shown that under certain sparsity conditions mBIC is consistent and has 
some optimality properties. Furthermore mBIC has been studied in the context 
of Generalized Linear Models [52j as well as Zero Inflated Generalized Poisson 
Regression [21]. 

Recently, in [T7] a new modification of BIC, extended Bayesian Information 
Criterion (EBIC), was proposed. EBIC assigns a prior probability for the model 
dimension q, which is proportional to for some K G [0, 1]. This results in the 
criterion 



EBIC: — 21ogLj(0j) + q- s \ogn + 2 log 

If k — 1 then EBIC coincides with BIC. The choice k = corresponds to the 
uniform prior on the model dimension. In [T7] some consistency properties of 
EBIC are proved under the assumptions that the maximal dimension searched 
by EBIC is fixed and larger then the true number of effects. In [18] EBIC was 
further extended to Generalized Linear Models and in |53j it was successfully 
used for GWAS with binary traits. 

While EBIC turns out to work very well in many practical sparse cases, it 
has one undesirable property. When q> % the last term of the penalty becomes 
a decreasing function of q and encourages to pick the largest possible model. 
Therefore in this article we will consider a slightly different criterion, 

mBIC2: - 2 logLj(0j) + £j(logn + 2 logp + d) - 2 logfo!) , (13) 

which is asymptotically equivalent to EBIC with k = when the maximal allow- 
able number of regressors, Q, is of the order Q = o(p). 

mBIC2 was developed in [26] as a model selection rule which in the context 
of multiple regression works similarly to the Benjamini-Hochberg correction for 
multiple testing. In [26J a thorough discussion is provided how this modification of 
mBIC relates to a similar criterion suggested by [1] as well as to a modification of 
the risk inflation criterion RIC, proposed in [21]. Due to the negative extra term 
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mBIC2 will potentially select larger models than the original mBIC (12). In [26] 
it is shown that mBIC2 has asymptotic optimality properties for a much larger 
range of sparsity levels than the original mBIC. We will compare the behavior of 
both criteria to select causal SNPs in the simulation study of Section [3} 

2.3. Search algorithm 

An important question when applying a model selection approach to GWAS 
data is how to deal with the gigantic number of possible models. Some interesting 
search strategies for the best multiple regression model were recently proposed 
e.g in [53] and [H]. However, these advanced model selection strategies are rather 
unfeasible for the large scale simulation studies. Therefore, for the purpose of our 
simulation study we developed our own search strategy, whose initial step relies 
on some modification of the popular forward selection. Our method takes into 
account the fact that we are expecting a rather moderate number of causal SNPs 
(somewhere below 100) and turns out to be relatively accurate and fast enough 
to allow for a simulation study based on more than 300000 SNPs. 

In an initial step we perform single marker tests for all SNPs, a step we have 
to take in any case to be able to compare our model selection approach with 
the popular Bonferroni and Benjamini Hochberg (BH) procedures. For further 
analysis we only consider SNPs with an uncorrected p- value smaller than 0.15. 

The second step consists of a simplified forward search strategy which we 
call multiple forward search. To this end we start with computing the original 



BIC (11) for the single marker model with lowest p- value. Then we proceed 
iteratively by considering SNPs in ascending order of single marker p-values and 
decide based on BIC to enhance the current model by a new SNP or not. This 
procedure is performed till we have considered all SNPs, or we have reached a 
maximum model size of 140 SNPs. For practical reasons we do not allow for 
larger models at this stage. 

The initial multiple forward selection, based on the uncorrected BIC, is ex- 
pected to include a lot of false positives in the model, but hopefully also many 
causal SNPs. Its principal advantage is that the actual search procedure, based 
on modifications of BIC, is not starting from the null-hypothesis, but from a large 
model for which the residual sum of squares will have been considerably reduced. 
From here we start to perform backward selection and then stepwise selection 



based either on mBIC (12) or on mBIC2 (13). This search strategy is designed to 
overcome the difficulties discussed in Section 2.1 when the actual model includes 
a large number of causal SNPs. It works well in the simulation study of Section 
[3j where more time consuming search procedures are out of question. In the real 
data analysis of Section [4] the procedure will be amended with a final step, where 
all subsets of a specified set of SNPs are considered. 
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3. Simulation study 



Simulations are based on SNP data from the population reference sample 
POPRES [38] - The dataset used for simulations in this manuscript was obtained 
from dbGaP through dbGaP accession number phg000027.v2.p2 at 
www. ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs000145. vl.pl. 
In particular we used data from the file Glaxo.txt, which contains a subsample of 
individuals studied in the article [22]. It comprises genotypes from 309788 SNPs 
for 649 individuals, which all have European ancestry and represent a relatively 
homogeneous population. 

In this data set approximately 5% of genotype values were missing. To deal 
with these missing values we adopted the following imputation strategy. Suppose 
Xij, the genotype of SNP j for the i-th individual, is missing. We search for the 
4 SNPs with strongest correlation to SNP j fulfilling two conditions: They are in 
a neighborhood of 500 SNPs upstream or downstream of SNP j and their values 
for the i-th individual are not missing. If we find individuals who have exactly 
the same values as the i-th individual on these 4 SNPs, then we predict the value 
of as the most frequent value of SNP j among these individuals. If we cannot 
find individuals fulfilling the above mentioned condition, then the most frequent 
value of the jth SNP among all individuals is imputed. 

Since the main purpose of this experiment is to present some basic properties 
of different approaches to GWAS, both our simulation and search procedures 
treat the final set of obtained SNP genotypes as the "correct" one. Therefore, 
the imputation procedure has no influence on the final results. 

Among the p = 309788 SNPs we have chosen k = 40 SNPs from autosomal 
chromosomes to be causal. These were selected deliberately in such a way that 
they are common and well distributed over all chromosomes. The minimum allelic 
frequency for all causal SNPs was ranging between 0.3 and 0.5; variance of their 
genotype data was ranging between 0.42 and 0.53; and correlations between all 
possible pairs of causal SNPs was between -0.12 and 0.1. For the considered 
sample size this range of sample correlations corresponds well to the range of 
random sample correlations between independent SNPs. 

We simulated 1000 replicates from the additive model Q where j* indicates 
the 40 causal SNPs. Error terms were sampled from a standard normal distribu- 
tion, i.e. cr 2 = 1. The 40 effect sizes were equally distributed between 0.27 and 
0.66. The overall heritability, defined as 

= 

1 + Var(XJ*/3j*) ' v ; 

is equal to 0.81. Heritability of an individual effect considered, defined as 

/??.Varfo.) 
1 + Var(X>*# 



h r = i ,\r ^ > ( 15 ) 
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ranges between 0.006 and 0.037. 

We are aware of the fact that the overall heritability is unrealistically large, 
but we consider it instructive to present the difficulties of the multiple testing 
procedures, which occur even in this simplified setting. We believe that the 
phenomena discussed further in this paper, which result from a large number of 
causal SNPs, can play a role in explaining the problem of 'missing heritability' 



in GWAS [M], a point which we extensively discuss in Section 3.2 



Each simulated data set was analyzed using multiple testing procedures (Bon- 
ferroni and Benjamini Hochberg) as well as model selection approaches based on 
mBIC and mBIC2. Bonferroni multiple testing correction was performed at fam- 
ily wise error rate a = 0.05, which corresponds to an adjusted significance level 
of approximately 1.6- 10 -7 . Benjamini Hochberg procedure was performed at the 
correponding FDR level a = 0.05. Model selection with mBIC was based on the 
constant d = — 21og(4), which serves as a standard choice (see e.g. [15]). Based 
on the calculations of [12] we expect that for p and n of this data set mBIC con- 
trols the family wise error under the total null approximately at a level a = 0.02. 
We have also computed Bonferroni correction and BH at this smaller level, but 
given the observed lack of power of both BH and Bonferroni these results are not 
presented. 

In GWAS studies it frequently happens that not the causal SNP itself is 
detected as significant, but some SNP whose genotype is highly correlated to 
the causal SNP. Such a finding is not necessarily to be considered as a false 
positive, which leads to the question how to define true and false positives for 
correlated regressors. We adopt the following convention: Any detected SNP 
whose correlation to a causal SNP has absolute value larger than a given threshold 
is counted as a true positive, otherwise as a false positive. We initially used 
a threshold of \R\ = 0.9, and based on simulation results decided to report 
alternatively also results for a threshold \R\ = 0.7. Here \R\ is the maximum 
absolute value over all correlations with causal SNPs. 

For multiple testing procedures it frequently happens that two or more se- 
lected SNPs are correlated with a causal SNP. In case they are above the speci- 
fied threshold they are all counted as just one true positive. False positive SNPs 
with identical genotype are only counted once. Based on these definition we es- 
timate the power of detection for each individual causal SNP, as well as the false 
discovery rate (FDR). 

The graphs in Figure [2] show the observed FDR values for all 1000 replicates 
based on the thresholds \R\ = 0.9 and \R\ = 0.7 respectively. Both model 
selection procedures show much less variation in FDR than BH, which is a direct 
consequence of the fact that the model selection procedures have much larger 
power. The false discovery rate of BH at level 0.05 is apparently larger than 
FDR of mBIC2 with the constant d = —2 log 4, though in absolute terms the 
number of false positives is often smaller for BH. The choice of the threshold 
for a "false positive" has a strong effect on FDR. For all procedures FDR is 
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Figure 2: Observed FDR of the 4 different selection procedures. A detected SNP was classified 
as a true positive when its maximal correlation to a causal SNP was larger than 0.9 in the first 
graph, and larger than 0.7 in the second graph. 



significantly reduced when using the more liberal criterion \R\ = 0.7. In particular 
the Bonferroni procedure detects in that case hardly any false positives. The effect 
of the threshold on the number of false positives and on power is discussed in 
more detail in Section 13.11 

The graphs in Figure [3] provide the estimated power to detect each of the 40 
causal SNPs at the threshold levels \R\ = 0.9 and \R\ = 0.7 respectively. The x- 
axis shows the individual heritability (15) of each SNP. It is evident that mBIC2 
has the largest power among the 4 different procedures. Also, as expected, mBIC 
has in most cases larger power than the two multiple testing procedures. 

Most remarkable is the dependence of power on the individual heritability. As 
expected there is a general trend that larger individual heritability yields larger 
power, but Figure [3] shows that there is also a huge amount of irregularity. For 
mBIC2 in general the association between individual heritability and power is 
quite strong. Still, when using threshold \R\ = 0.9, there are several SNPs with 
relatively small power. The most striking example is SNP A-1912140, with a 
large heritability of 0.03 and power of approximately only 33%. This specific 



case will be explained in detail in Section 3.1 



For the multiple testing procedures the behavior is much more erratic. The 
order of SNPs with respect to power is entirely different from the order in terms of 
individual heritability. For example, the SNP with the largest heritability, easily 
detected by mBIC2 and mBIC, is completely missed by the procedures based 
on individual tests. On the other hand, some substantially "weaker" SNPs, are 
detected with a power exceeding 50%. The key to understand this phenomenon is 
the influence of correlation between causal SNPs on the noncentrality parameter 



of the model sum of squares, as discussed in Section 3.2 We will see that this 
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Figure 3: Power of the 4 different selection procedures. 



has not only a negative effect on detecting causal SNPs, but it also gives rise to 
numerous detections of false positive SNPs which have no relation at all to the 
quantitative trait. This result we consider to be the most important outcome of 
this simulation study. 

3.1. Dependence of TP and FP on \R\ - thresholds 

Table [T] shows how often certain SNPs occur as false positives based on the 
threshold \R\ = 0.9 for mBIC2 and for BH. The first significant finding is that 
SNP A-2299101, detected in 670 simulation runs, has a correlation of 0.8958 with 
causal SNP A-1912140. This explains the low power of 33% to detect causal SNP 
A-1912140, and we conclude that SNP A-2299101 is actually not really a false 
positive, but rather it is detected instead of SNP A-1912140. Thus the threshold 
\R\ = 0.9 to determine false positives is apparently too strict. 

Looking at the first column of Table [T] we observe that all SNPs detected 
by mBIC2 in more than 5 simulation runs have \R\ > 0.5. Since none of the 
statistical approaches to GWAS can clearly distinguish SNPs which are closely 
correlated, we believe that instead of reporting just one detected SNP, one should 
report also all SNPs which are strongly correlated to it. The smaller the suitable 
threshold the more SNPs one has to report, and to this end the value 0.5 appears 
to be fairly small. As a compromise we decided to consider \R\ = 0.7 as a suitable 
threshold. With the exception of SNP A-4270622 all SNPs which are detected 
by mBIC2 in at least 18 simulation runs are then classified as true positives. 

On the other hand there is a relatively large number of SNPs which are 
detected by mBIC2 as false positives only once (1076), twice (55) or three times 
(12). These detections, which might be classified as actual false positives, are 
usually not correlated to any causal SNP. Based on a model selection approach 
one would expect to detect some false positives of this nature, and their frequency 
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Table 1: Thirty most frequent false positive SNPs (based on \R\ < 0.9) under mBIC2 and 
under BH. First the SNP name, then the frequency in how many simulation runs the SNP was 
detected as a false positive, and finally the absolute correlation \R\ to the closest causal SNP. 
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is controlled according to the theory of mBIC2. 



3.2. Multiple testing procedures and heritability 

The second column of Table [T] shows the thirty most frequent false positive 
SNPs under BH. There are several SNPs which coincide with SNPs from the first 
column. Practically all other SNPs which have not been detected by mBIC2 as 
false positives have a striking characteristic: They are not strongly correlated to 
any causal SNP. The most prominent example is SNP A-2181789, which has been 
detected 354 times as a false positive, but is only correlated with \R\ = 0.2628 to 
the closest causal SNP. 

We will now provide the explanation for this seemingly implausible result, and 
will also explain the erratic behavior of power in terms of individual heritability 
observed in Figure [3] Remember that F-tests of single effect models involve 
non-centrality parameters ^ and (|8]). We can rewrite the square root of Q as 



3 j 



This shows that in case of orthogonal design matrix, %j is proportional to the 
individual heritability. However, in the general case the non-centrality parameter 
is modified according to Y^i^j 0iCov(xj,Xi). This term can occasionally become 
fairly large when there is a large number of true signals. Note that we designed 
our simulation study such that pairwise correlation between SNPs was small. In 
a statistical sense the genotypes of the causal SNPs can be thought of as being 
independent. Still, for some causal SNPs the effects of correlation just by chance 
accumulate significantly. 

The first plot in Figure [4] shows that small pairwise correlations with other 
causal SNPs explain the erratic behavior of the multiple testing procedures. 
When we plot the power against the square root of the non-centrality param- 
eter %j we observe the regular behavior of a sigmoid function. Clearly not the 
individual heritability but the weighted sum of correlations to all causal SNPs 
from ([7]) determines the power to detect an individual SNP. This observation is 
crucial. It calls into question the practice of reporting detected SNPs according 
to the order of p- values from multiple testing procedures and claiming that SNPs 
with smallest p-values are the most important ones. It might as well be the case 
that such signals are just catching the effect of many other causal SNPs which 
themselves are not detectable. Also, since most of the detected pairwise corre- 
lations between "false" SNPs and the trait result only from random fluctuations 
of sample correlation coefficients between these SNPs and the causal ones, they 
are not replicable in different samples from the same population. Therefore, such 
"false positives" are useless also in terms of predicting the trait values. 

The second plot in Figure [4] gives the answer to the question why some SNPs 
occur so frequently as false positives when they are not at all correlated with any 
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Figure 4: First plot: Power of the multiple testing procedures. Instead of the individual 
hcritability we plot against the square root of the noncentrality parameter. Second plot: Square 
root of the noncentrality parameter of correlations for all false positives occurring under BH 
(at level \R\ = 0.9). On the x-axis SNPs are ordered according to their frequency of detection 
in the 1000 simulation runs. The initial first stars from the left correspond to the SNPs listed 
in the second column of Table [T] 

causal SNP. It turns out that all false positive SNPs under BH have relatively 
large noncentrality parameter (tJvmj > 0.23), and in particular those 30 SNPs 
listed in the second column of Table flj all have y/vuj > 0.32. This is just the 
onset of the sigmoid function observed in the first plot of Figure [4] where the 
power to detect causal SNPs no longer vanishes. 

The conclusion from this analysis is the following. If we believe that a trait 
in a genome wide association study is influenced by a relatively large number of 
genes, then multiple testing procedures have a large chance of missing many of 
these genes. On the other hand there is a large chance of detecting false positive 
SNPs which have nothing to do with functional regions. This appears to be 
quite a devastating resume for the performance of multiple testing procedures in 
GWAS with complex traits. 

4. Real data analysis 

In a series of papers Stranger et al. [HI H2J HZ] have analyzed the association 
between SNPs from the HapMap project [29J and gene expression data. In par- 
ticular in [1Z] they considered 270 individuals from four populations, namely 30 
Caucasian trios of northern and western European background (CEU), 30 Yoruba 
trios from Ibaden, Nigeria (YRI), 45 unrelated individuals from Beijing, China 
(CHB) and 45 unrelated individuals from Tokyo, Japan (JPT). The major objec- 
tive of jlZ] was to find so called cis associations and trans associations between 
SNPs and gene-expression, where the cis region for SNPs was defined to be 1-Mb 
upstream or downstream of the expression probe midpoint. 
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For statistical analysis [U] used a permutation test approach for test statistics 
obtained with simple linear regression models only considering additive effects. 
Excluding the 60 children from the CEU and YRI trios left 210 unrelated individ- 
uals. Analysis was performed considering the four populations separately, as well 
as combining data from individuals of certain populations. Pooling over all four 
populations provides the most powerful approach, and we will thus restrict our 
analysis to this situation. To deal with population structure [UJ used a procedure 
based on conditional permutations, which was originally described in |31j . 

We want to compare our model selection approach in particular with the 
analysis of [UJ for trans association of gene expression with SNPs. Pooling all four 
populations they found 44 genes showing trans association, where detailed results 
are provided in their Supplementary Table 6. To make our results comparable 
with [UJ we also restrict ourselves to additive models of the form Q, though 
we believe it would be interesting to consider additionally dominance effects. 
To account for population structure in our models we add dummy variables for 
populations CEU, CHB and JPT. 

In [U] only some 25000 candidate SNPs were considered as putatively func- 
tional SNPs. Unfortunately these SNPs were not clearly specified (the corre- 
sponding link in the supplementary material is not providing the relevant list 
of SNPs), and we decided to search over all available SNPs. As a consequence 
mBIC2 will use a much larger penalty for multiple testing, on the other hand 
functional SNPs might be found which were not at all considered by [47] . 

Starting point was the set of SNPs from phase 2 of the HapMap project. In 
accordance with [UJ we only considered SNPs with MAF > 0.05, which results 
in a set of 2698476 SNPs. Filtering identical SNPs yields a subset of 2145627 
SNPs, many of which are strongly correlated. In that situation many of the SNPs 
do not bring substantially new information to the genotype data. To solve this 
problem in [11] the notion of 'effective number' of markers was introduced, an idea 
which can be also found e.g. in [3D]. The 'effective number' of markers is used to 
replace the number of available regressors in the penalty for modifications of BIC. 
In our real data analysis the 'effective number' of SNPs is calculated based on 
the clustering algorithm described in [25]. This algorithm yields clusters of SNPs 
which all have pairwise correlation above a chosen threshold C. In accordance 
with results of Section [3] we have chosen C = 0.7, which leads to an effective 
number of 780675 SNPs. 

We performed model selection based on mBIC2 ( |13[ ) with p = 780675 and we 
applied the search algorithm described in Section |2.3| We observed that in some 
cases the stepwise selection procedure got trapped in local minima for models 
which are too large. Therefore we added a final step to the search strategy, where 
we performed an all subset selection over the combined set of SNPs detected by 
mBIC2 and by [UJ. If this set of SNPs was excessively large (> 25) we performed 
backward selection, and all subset selection only for models including less than 5 
SNPs. Table [2] shows a comprehensive summary of these results. 
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Table 2: Summary statistics for the 44 genes reported in [47]. Col. 2: Number of tag SNPs 
representing detections by [47] for cis and trans association {original number in curly brackets}. 
Col. 3 and 4: Number of SNPs detected by mBIC2 as well as number of matches with and 
without taking into account population structure [number of cis SNPs within box]. Col. 3 also 
shows p-values of F-Test for dummy variables. Col. 5: Categories of genes as described in the 
main text. 
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As discussed in Section 3^ SNPs found by model selection are naturally rep- 
resentatives of a number of correlated SNPs. On the other hand many SNPs 
detected by the multiple testing approach from [UJ are strongly correlated and 
frequently even have identical genotype for all individuals. To make results com- 
parable we selected representatives of correlated SNPs from [37J by applying 
Tagger [6], a tag SNP selection algorithm implemented in Haploview [7]. In ac- 
cordance with the discussion in Section 3.1 we used as threshold \R\ = 0.7 (i.e. 
R 2 = 0.49). 

Table [2] provides the number of tag SNPs for cis and for trans associations for 
the 44 genes with trans association reported in [17] . Furthermore we provide the 
number of cis and trans association detected when using mBIC2, first for models 
with dummy variables corresponding to different populations, then for models 
without such dummy variables. We also report the number of matches between 
[4"7] and our model selection approach, where we define that a match occurs when 
the absolute correlation between a tag SNP and a SNP detected by mBIC2 is 
larger than 0.7. 

For models considering population structure we report p-values of F Tests 
on the overall effect of the 3 dummy variables. Taking into account population 
stratification is important. Without including dummy variables in most genes 
the number of detected trans SNPs is inflated. Almost all of these additional 
findings are associated with population structure, which corresponds well with 
the small p-values observed in column three. For most genes the expression levels 
vary between populations, and among the huge number of SNPs there will always 
be some which pick up this variation. 

Results are arranged according to the categories presented in the last column. 
In category A we collect 19 genes where the model with dummy variables found 
all SNPs from [UJ, in the 12 genes of category B it did not find any of them, and 
in the 6 genes of category A/B it detected some but not all of them. Category 
C collects 7 genes for which [UJ reports an extraordinary large number of SNPs. 
When we take into account population stratification, then for the majority of 
genes of category A and A/B our results are quite similar to those of [47J. In 
category B there are 7 genes for which Stranger reported 1 or 2 associated SNPs 
which were not detected using mBIC2. This is not surprising given the fact that 
we were penalizing for a much larger number of markers. 

On the other hand, results for the genes hmm25278-S, hmm26651-S, hmm34610- 
S and hmm32074-S are very interesting. These genes are located on chromosome 
1, 20, 8 and 6 respectively, but their expression levels are strongly correlated 
(pairwise correlation larger than 0.92 for each possible combinations). [UJ re- 
ported the following trans SNPs: rs9528181 for all four, rs7318180 for the first 
three, and rsl2860901 for the first two or them. These SNPs are all located on 
chromosome 13 at positions 113893447, 113835272 and 113901892, respectively, 
which indicates that this region on chromosome 13 has strong regulatory influence 
on the four genes under discussion. 
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Table 3: Models selected for the genes hmm25278-S, hmm26651-S, hmm32074-S and 
hmm34610-S. SNP name (first line), chromosome and position (second line) for each selected 
SNP. 
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rsl3021147 
2 107939438 


rs3761945 
1 228773391 






5 more SNPs on 
Chr. 5, 9, 10, 18, 22 


rsl7326215 
7 24408655 



For all these genes model selection is finding larger models, which are sum- 
marized in Table [3j All four models include a SNP on chromosome 13 which rep- 
resents the detection of |47] . Furthermore all four models include trans SNPs on 
chromosome 2 and on chromosome 3, though not all of them are located in prox- 
imity. Apart from that there is a certain amount of ambiguity. For hmm25278-S 
there is one more SNP which does not correspond to any of the other detections. 
Models for the other three genes agree on SNP rs2044109, and they all include 
a SNP on chromosome 1. The models of hmm32074-S and hmm34610-S include 
further non-corresponding SNPs. In summary, according to our study there is 
strong evidence that more than one region has regulatory influence on the expres- 
sion levels. Also this example shows that for the future a multivariate approach 
taking into account the information of correlated traits might be of some interest. 

For the genes discussed above the three trans SNPs detected by [47] are lo- 
cated very close to each other on the same chromosome. This is actually typical: 
For all 44 genes, the reported trans SNPs are located within a relatively small 
region. This holds even for the 7 genes of category C, which are characterized 
by an untypically large number of SNPs reported in (47]. These SNPs have a 
rather complex correlation structure, but their positions are for all cases within 
less than 400 kb. 

If we take for example gene GL19557676-S, the reported cis SNPs (chromo- 
some 6, between pos. 31105671 and 31439808) and trans SNPs (chromosome 6, 
between pos. 30045241 and 30049163) are located fairly close to each other. One 
might think of an extended cis region, and mBIC2 is finding 3 SNPs (2 cis, one 
trans) which represent the genetic variability within that region. Although the 
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trans SNP rs3823342 (chr. 6, pos. 30021046) found by model selection is not 
a match according to our definition based on correlation, it indicates the same 
region. The same is true for gene GL33469144-S, where SNP rs2996607 on chro- 
mosome 10 is in the same region as all the trans SNPs reported by (47], though 
based on the correlation criterion it does not count as a match. 

If we look at the genes GL10864076-S (Chr. 16) and GL23510353-S (Chr. 
19), in both cases mBIC2 found one matching trans SNP which turns out to 
be strongly correlated with all SNPs reported by [47], namely \R\ > 0.49 for 
GL10864076-S and \R\ > 0.48 for GL23510353-S. Now interestingly, for genes 
GL37537711-S (Chr. 5), GL41150880-S (Chr. 18) and GL42657060-S (Chr. 4) 
the trans SNPs found by (47j are all lying exactly in the same region as those 
of GL10864076-S and GL23510353-S (Chr. 6, between position 32500000 and 
32800000), and also many trans SNPs are actually shared by these genes. It is 
clear that this region must have a particularly strong regulatory effect on other 
genes, that is susceptible to genetic variability. Multiple testing strategies pick 
up many correlated SNPs reflecting these signals, whereas mBIC2 is detecting a 
smaller number of SNPs representing that region. 

Finally we want to mention several other genes for which additional trans 
SNPs have been found, namely GI 21536317-S, GI 31543145-S, GI 41147791-S, 
GI 42659564-S, GI 42662536-S, Hs.514777-S and Hs.517172-S. Perhaps most re- 
markable among those are GI 41147791-S and GI 31543145-S, where the model 
selection approach was able to detect a cis SNP which was not detected by mul- 
tiple testing. 

5. Discussion 

We have introduced a model selection approach for genome wide association 
studies using modifications of BIC which are based on sound theoretical con- 
siderations [151 [26] . Elementary statistical arguments have shown that a model 
selection approach is preferable to multiple testing strategies, and a comprehen- 
sive simulation study confirmed this. Finally we performed a real data analysis 
based on 210 individuals from the HapMap project, where model selection pro- 
vided some interesting detections not found by the original analysis based on 
multiple testing. 

Perhaps the most important result we obtained is that under complex models 
one cannot trust the order of p-values from single marker models. Test statistics 
are highly influenced by random small correlations to causal SNPs, leading on 
the one hand to a large number of false positives, on the other hand to severely 
reduced power. This loss of power might be one aspect of the widely discussed 
phenomenon of missing heritability in GWAS (for a recent discussion see [34]). 
It is believed that missing heritability might be found in rare SNPs, or that 
epigenetic effects might play an important role [30] ■ However, our results indicate 
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that the statistical analysis performed is an important aspect of the problem, and 
that multiple testing strategies are just not really well suited for GWAS analysis. 

In our simulation study model selection performed unambiguously better than 
multiple testing. In real data analysis, compared to the original analysis, a sub- 
stantial number of new putative regions of trans association could be found. Still, 
the effects were not as strong as in the simulation study; the largest model in- 
cluded 11 SNPs, two models were of size 7 and 5, the rest of size 4 or smaller. 
We believe that this is mainly due to the rather small sample size. To select 
more complex models one would need studies with a larger number of individ- 
uals. Then it is expected that differences between multiple testing and model 
selection are getting even more pronounced. 

To deal with the huge number of potential models we introduced a rather sim- 
ple search strategy designed for this particular application. Our search strategy 
served well in the simulation study, but it had some limitations in the real data 
analysis. The focus of this manuscript was not on search strategies. Modifica- 
tions of ideas presented in [3] in the context of QTL mapping might be useful. 
Other possible approaches have been discussed in [531 EE] • The exact choice of 
model search strategies in GWAS is certainly a fruitful topic for further research. 
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